Gathering Sensor Data

Status: accepted
Deciders: Artur Wojnar, Andres Lamont
Date: 2022-08-01

Technical Story: https://orikami.atlassian.net/browse/DBP-455

Context and Problem Statement

In the linked issue a patient would generate up to 2GB of data in a daily measurement which is taken once per month. But we know that this is a requirement for a one study, but in future we will be taking bigger loads, thus we can think of providing an architecture that would be suitable for gathering information from IoT devices.

Decision Drivers

Scalability. We should be able to spread the load among all of our gateway’s instances
A known format/structure of sensor data sending to the API to not figure out enything by our own
Use of ready-to-use components and standards
Do not persist sensor data in the Apache Pulsar
Be able to handle a constant load of sensors' data
Be able to get data sensors from specific period for analysis purpose (e.g. of half year)
Do not overburden the Gateway with constant requests
Off load older data out of the database
Cut off on the data sending from the client

Considered Options

Separate time-series collection in MongoDB (version >= 5.2). The granurality should be “seconds”, because the mobile stores the data in batches and sends them every 2 seconds. Reference.Configure the Atlas MongoDB Archive (Reference). !Attention!: Online Archive for timeseries collection is available as a Preview. The feature and corresponding documentation may change at any time in the Preview stage.We have to configure the Archive rules by our API and Atlas API, because we store data per tenant so we have multiple time series collections. API Reference.
Protocol communication
1. Websockets to handle the sensors data load. Since the Pulsar supports WS we can proxy the data through the Gateway, so we would have an open socket 2 open sockets for one patient between the client-gateway and the latter for the gateway-pulsar connection.The Pulsar gives the possibility to use the WS protocol with it, but it’s pointless in out case; As the docs say: “Pulsar WebSocket API provides a simple way to interact with Pulsar using languages that do not have an official client library.” (NodeJS has a native pulsar-client). Referece.The gateway has to do JWT authentication - the JWT token will be passed through the WS endpoint and validated on the client’s connection initialization on the Gateway.With the native Pulsar’s WS connection we can forward the stream directly from the client to the Pulsar, but it requires creating a separate Pulsar connection.
2. HTTP/2 | gRPC (Client streaming configuration. Reference.) + Protobuf. It’s actually HTTP/2 protocol, so it accelerates speed data flow. Reference.gRPC streaming allows to have limited number of open HTTP/2 streams, the default it’s 250 but it can be increased up to 5000 (traefik link). Above the limit the next RPCs would be queued (link).Remember to use keep-alive ping.We have to forward the message from the stream to the Pulsar, but we can just send the binary data we get wrapped in an envelope with some metadata. The same way as for the WS.
3. REST API. Just for documentary purpose. Not really considered.

Decision Outcome

Chosen option: option 1 and 2 because we don’t have better alternatives. Option 3b because HTTP/2 is still more HTTP than WS, we can send binary data in a contracted way with protobuf and we have a protocol-compliant way to deal with the OAuth2. Moreover, in future we can use gRPC in microservices communication and also we can use multiplexing if we need to send more streams from within one client.Pulsar communication stays untached, we have a created producer that will be used to send the stream.Later on, if further optimizations are needed, we can use a MongoDB sink combined with the Pulsar.

Positive Consequences

Significantly improved performance while dealing with streaming data from IoT devices.

Negative Consequences

Small impact on the needed implementation and enhancements plus learning curve of gRPC streaming.

Pros and Cons of the Options

[option 1]

It’s the only reliable options for this purpose. We have to make sure we run MongoDB version >= 5.2 to have provided optimizations for Time Series feature, like compression which we really need to cut down on size because our data will be similar to each other where values will differ only by bits.

We may not need old data, so using the built-in Atlas feature is perfect match for us. The only cons of it is the necessity of manual or automated way to configure Archive rules. Each rule is linked to one database and one collection.There can be long-running studies that can hold data for long time. We don’t know much abut that case, maybe we will consider a data leak in future but for now the Archive rules are enough.Cons is that this feature is still a Preview, but it’s resonable to expect it becomes stable soon.

[option 2a]

Pros:

Traefic supports the WS out of the box (link)
The solution is stateless
Challenging the oauth on connection
Theoretically, bigger number of concurrent open connections

Cons:

No built-in binary protocol like Protobuf
It’s not HTTP
No built-on oauth support (tweak)
learning curve
problems with scalability after initiating a connnecti

[option 2b]

Pros:

Traefic supports the WS out of the box (link)
The solution is stateless
It’s still HTTP/2, no new protocol
Predefined contract for binary data structure
Multiplexing
gRPC can be used later for the internal microservices communication (if needed)
built-in oauth2 support
suitable for streams and real-time data, having (likely) better performance than WS

Cons:

learning curve

[option 2c]

Props:

no need for change in our source code
easy

Cons:

heavyweight
increased latency
traffic increased significantly (requests every second)
problems with scalability after initiating a connnection

Context and Problem Statement​

Decision Drivers ​

Considered Options​

Decision Outcome​

Positive Consequences ​

Negative Consequences ​

Pros and Cons of the Options ​

[option 1]​

[option 2a]​

[option 2b]​

[option 2c]​

Context and Problem Statement

Decision Drivers

Considered Options

Decision Outcome

Positive Consequences

Negative Consequences

Pros and Cons of the Options

[option 1]

[option 2a]

[option 2b]

[option 2c]